Linear Regression

In addition to the resources you have been exposed to in order to get an understanding of the linear regression algorithm. Another way to gain a more intuitive understanding of linear regression is by building the algorithm from scratch, without the assistance of sklearn. This will also help you build an intuition behind how linear regression works and set you apart from simply knowing how to use sklearn to actually understanding what you are doing and how to customize your solution based on the problem you are dealing with.

One of the best video tutorials of linear regression is Andrew Ng's Machine Learning series on Machine Learning. We recommend that you watch his lectures to gain an even deeper understanding about linear regression either before or after going through this notebook. You will find the link to his videos at the end of this notebook.

What is Linear Regression

Linear regression is a statistical way of measuring the relationship between variables. We can use this to extrapolate and predict an outcome variable based on its features. The outcome in this context is what we are trying to predict while the feature are other variables in our data excluding the outcome variable. Looking at our data we try and find the line that best explains this relationship between these two factors.

How we do this can be explained by the formula y = mx + b

In this context y is the outcome variable we are trying to predict, m is the slope. The slope in this context is the rate of change in our outcome variable as m changes.

X, on the other hand represents the value that we don't know. So for example if we only have 1 feature and would like to predict one outcome variable. X would be the value of the feature and Y would be what we expect the outcome to be given the value of our variable X.

B represents the bias or the intercept, this is where our line intercepts with the y-axis.

Intuitive explanation

Let's assume we want to predict the price of houses based on the size of each house. We have a scatterplot of data with the y axis representing the price of the house, while the x-axis represents the size of the house. With simple linear regression in this scenario we would like to predict the price of houses based on the value in the x-axis. We can simply draw a line through the data that best fits into each data point we have. We use this line to predict the value of house.

Now let's look at how we would create this function in Python with a simple example:

We would like to predict the price of a house based on number of rooms in a given house. In this context let us assume that we already know the values of m and b, which are 182995 and 232995 respectively. Let us try and predict the price of houses with 2, 7 and 12 rooms

In [1]:
from IPython.display import HTML
In [2]:
#lets look at how we would create this function in Python
def simple_linear_regression(slope,val,intercept):
    return (slope*val)+intercept
print('We predict that the price of a 2 roomed house is: R {:,.2f}'.format(simple_linear_regression(182995,2,232995)))
print('We predict that the price of a 7 roomed house is: R {:,.2f}'.format(simple_linear_regression(182995,7,232995)))
print('We predict that the price of a 12 roomed house is: R {:,.2f}'.format(simple_linear_regression(182995,12,232995)))      
We predict that the price of a 2 roomed house is: R 598,985.00
We predict that the price of a 7 roomed house is: R 1,513,960.00
We predict that the price of a 12 roomed house is: R 2,428,935.00

What if you have multiple features/variables?

The formula changes slightly in the instance where we want to predict the price of a house using multiple variables. Instead of simply multiplying X to M and adding B, we carry out this step for each variable and then add the total to find the outcome.

y = $m_{0}$ + $m_{1}x_{1}$ + $m_{2}x_{2}$...+ $m_{n}x_{n}$ + e

We start of by taking the value of $m_{0}$ which represents the value of the y-intercept when all the parameters are set to zero. For each variable we then find it's slope, multiply the two together, add the mx values for all our variables and finally add both our intercept and the model error (this represents the amount of variation in our estimate of y).

Cost functions

Rememeber that the core focus of machine learning is to learn from the data we expose it to. We have all been children at some point in our lives. As a child you learnt the dangers of a stove either by being told about it and avoiding it, or if you were on the more adventurous and rebellious side through touching a stove and getting burnt.

I for example learnt about the dangers of playing with fire by stepping on hot ash barefeet. The pain helped me modify my behaviour. Firstly it helped me understand that fire is not something to be played with and also taught me to either wear sandals or shoes when going close to an outdoor fireplace on a windy day or just to be more aware about my surroundings. We can liken this to a cost function that helped me minimize my mistakes atleast in the context of outdoor fire places.

Similarly, in Machine Learning we also have cost functions which measure how wrong a model is in its ability to estimate the relationship between a feature or multiple features and an outcome. You can also think of this as the distance between what the actual outcome and what we predict. In machine learning we therefore seek to minimise this cost function. One way to measure this cost function is mean squared error (MSE), this gives you the mean value of the delta between our predictions and the actual outcome values squared. Minimising this metric gives us the line of best fit. So how do we minimize the cost function?

Minimizing the cost function

There are quite a few algorithms that minimize the cost. One of these is Gradient Descent.

According to Hands on Machine Learning, "the general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function".

What does this mean? Well for starters this cost function is not specific to solving linear regression problems.

Let's assume you are in Cape Town and for some strange reason walking down Table Mountain blind folded, maybe you took a bet and have an incredibly high risk tolerance and very little regard for your limbs and life.

You ultimately want to reach the bottom of the mountain. One way to increase your odds of getting down as fast as possible is to feel the ground in every direction and take a step towards the direction where the ground appears to be descending the fastest. The pace at which you descend the terrain however also depends on the size of your steps. If you steps that are oo small you would probably need to take more steps to get to the bottom. On the other hand, if you try and jump with each step you may overjump and possibly find yourself higher than you were before. You can think of the size of the steps as the learning rate.

You would then continue carrying out this process until you hopefully get to the bottom of the mountain in one or multiple pieces.

In [3]:
#Gradient Descent Intuition by Andrew Ng
from IPython.display import YouTubeVideo
YouTubeVideo('rIVLE3condE')
Out[3]:

There are actually quite a few variants of Gradient Descent. Let's look at some of them.

Batch Gradient Descent

This is a variation of gradient descent that calculates the erros for each example in the training dataset but only updates the model after all the training examples have been evaluated. This makes the model more computationally effecient than say stochastic gradient descent as the frequency of updates is reduced. However, this may result in slower model updates and training speed in large datasets.

Stochastic Gradient Descent

This variation of gradient descent calculates the gradient for each update using only a single training data point that is chosen at random. This allows each update to be faster to calculate than batch gradient descent especially as the number of updates increases.

Mini-batch gradient descent

With this variation we calculate the gradient for each small mini-batch of our training data. We start off by dividing our training data into small batches and then proceed to perform one update per mini batch.

How do we choose the right learning rate

We typically pick a small value (e.g 0.1, 0.01, or 0.001) and adapt it based on whether or not the cost function is reducing too slowly or too erratically.

Additional Reading:

Now let's try and implement a multiple linear regression from scratch:

In [4]:
import pandas as pd
import numpy as np
import matplotlib.animation as animation
import matplotlib.pyplot as plt
path = 'C://Users//User//Downloads//us_housing_data.csv'
file = pd.read_csv(path)

Let's look at the first 5 rows of our data to better understand our dataset and the data types in each column

In [5]:
file.head()
Out[5]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view ... grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900.0 3 1.00 1180 5650 1.0 0 0 ... 7 1180 0 1955 0 98178 47.5112 -122.257 1340 5650
1 6414100192 20141209T000000 538000.0 3 2.25 2570 7242 2.0 0 0 ... 7 2170 400 1951 1991 98125 47.7210 -122.319 1690 7639
2 5631500400 20150225T000000 180000.0 2 1.00 770 10000 1.0 0 0 ... 6 770 0 1933 0 98028 47.7379 -122.233 2720 8062
3 2487200875 20141209T000000 604000.0 4 3.00 1960 5000 1.0 0 0 ... 7 1050 910 1965 0 98136 47.5208 -122.393 1360 5000
4 1954400510 20150218T000000 510000.0 3 2.00 1680 8080 1.0 0 0 ... 8 1680 0 1987 0 98074 47.6168 -122.045 1800 7503

5 rows × 21 columns

Great! Now we know that we have approximately 16 columns. We are leaving out the id column because we cannot learn anything valuable from it that can be generalized to data our model is not exposed to. We will assume that there isn't any value we can get from the dat, lat and long0 columns either. We know that our outcome variable is a continuous variable which means we are dealing with a regression problem.

Now we need to start off by splitting our data. We will split the data according into two; X (our features) and y (our outcome variable- what we are trying to predict)

In [6]:
#separting our data into outcome and features
X = file['sqft_living']
y = file['price']
x = (X -X.mean())/ X.std()
x = np.c_[np.ones(x.shape[0]), x]

As you hopefully already understand the purpose of our gradient descent algorithm is to find the lowest error value. In order to do this, we need a learning rate, in this case the rate is represented by alpha.

In [7]:
alpha = 0.01
iters = 2000
m = y.size
theta = np.random.rand(2)
#gradient descent
def gradient_descent(X, y, theta, iters, alpha):
    costs_history = []
    theta_history = [theta]
    for i in range(iters):
        preds = np.dot(X, theta)
        error = preds - y
        cost = 1/(2*m) * np.dot(error.T, error)
        costs_history.append(cost)
        theta = theta - (alpha * (1/m) * np.dot(X.T, error))
        theta_history.append(theta)
    return theta_history, costs_history
theta_history, costs_history = gradient_descent(x, y, theta, iters, alpha)
theta = theta_history[-1]

Ok, so what did we just do? Firstly, let's start by defining our inputs. Our alpha is our learning, iterations are the number of times we run our gradient descent algorithm in order to continually minimize our cost function until improvements flatten out. Theta is the optimized coeffiecients of our variables (we start of with a random number), while m represents the total number of samples in our dataset. Find the error between our predicted outcome and actual outcome using the coefficients/theta value we have. We then continually run this process, optimizing the value of theta by factoring in our learning value to minimize our error with each iteration.

We then append the value of our cost and theta values with each iteration to understand the changes of our cost function with each iteraction. This will help us see how our gradient descent is minimizing our cost function with each iteration.

Let's plot our cost history to see the improvements.

In [8]:
plt.title('Cost function History')
plt.xlabel('Number of iterations')
plt.ylabel('Cost Value')
plt.plot(costs_history)
plt.show()

When we plot the cost history we can see the cost value decreasing with each iteration with the improvements eventually flattening out.

In [19]:
!pip install ffmpeg-python
Requirement already satisfied (use --upgrade to upgrade): ffmpeg-python in c:\users\jasmi\anaconda3\lib\site-packages
Requirement already satisfied (use --upgrade to upgrade): future in c:\users\jasmi\anaconda3\lib\site-packages (from ffmpeg-python)
You are using pip version 8.1.2, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.
In [23]:
import matplotlib as mpl
import ffmpeg
matplotlib.matplotlib_fname()
mpl.rcParams['animation.convert_path'] = 'C:\Program Files\ImageMagick-6.9.2-Q16-HDRI\convert.exe'
In [19]:
from matplotlib import animation, rc
#Set the plot up,
fig = plt.figure()
ax = plt.axes()
plt.title('Sale Price vs Living Space')
plt.xlabel('Living space in squre feet (normalised)')
plt.ylabel('Price  of sale in $')
plt.scatter(x[:,1], y, color='black')
line, = ax.plot([], [], lw=2)
annotation = ax.text(-1, 700000, '')
annotation.set_animated(True)
plt.close()
#Generate the animation data,
def init():
    line.set_data([], [])
    annotation.set_text('')
    return line, annotation
def animate(i):
    x = np.linspace(-5, 20, 1000)
    y = theta_history[i][1]*x + theta_history[i][0]
    line.set_data(x, y)
    annotation.set_text('Cost = %.2f e10' % (costs_history[i]/10000000000))
    return line, annotation
anim = animation.FuncAnimation(fig, animate, init_func=init,
                               frames=300, interval=0, blit=True)
anim.save('animation.gif', writer='imagemagick', fps = 30)
MovieWriter imagemagick unavailable; trying to use <class 'matplotlib.animation.PillowWriter'> instead.
In [20]:
#Display the animation...
import io
import base64
from IPython.display import HTML

filename = 'animation.gif'

video = io.open(filename, 'r+b').read()
encoded = base64.b64encode(video)
HTML(data='''<img src="data:image/gif;base64,{0}" type="gif" />'''.format(encoded.decode('ascii')))
Out[20]:

Have a look at this interactive plot to understand how gradient boost optimizes our best fit line by iteratively minimizing the cost. Or in other words learning from the error between our predictions and values and modifying our coefficients to improve our predictions.

In [ ]: